NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

Man, Yunze; Huang, De-An; Liu, Guilin; Sheng, Shiwei; Liu, Shilong; Gui, Liang-Yan; Kautz, Jan; Wang, Yu-Xiong; Yu, Zhiding (June 2025, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Free, publicly-accessible full text available June 11, 2026
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Zhao, Yue; Xue, Fuzhao; Reed, Scott; Fan, Linxi; Zhu, Yuke; Kautz, Jan; Yu, Zhiding; Krähenbühl, Philipp; Huang, De-An (February 2025, cs.CV)

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance. Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
more » « less
Free, publicly-accessible full text available February 7, 2026
EAGLE : Exploring the Design Space for Multi-modal LLMs with Mixture of Encoders

Shi, Min; Liu, Fuxiao; Wang, Shihao; Liao, Shijia; Radhakrishnan, Subhashree; Zhao, Yilin; Huang, De-an; Yin, Hongxu; Sapra, Karan; Yccoob, Yaser; et al (April 2025, ICLR 2025)

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
more » « less
Free, publicly-accessible full text available April 24, 2026
Flextron: Many-in-One Flexible Large Language Model

Cai, Ruisi; Muralidharan, Saurav; Heinrich, Greg; Yin, Hongxu; Wang, Zhangyang; Kautz, Jan; Molchanov, Pavlo (August 2024, https://doi.org/10.48550/arXiv.2406.10260)

raining modern large language models (LLMs) is extremely resource-intensive, and repeatedly customizing them for deployment scenarios with limited compute and memory is impractical. This paper introduces Flextron, a network architecture and post-training model optimization framework that supports flexible model deployment. Flextron uses a nested elastic structure that adapts rapidly to user-defined latency and accuracy targets during inference without requiring additional fine-tuning. It is also input-adaptive, automatically routing tokens through sub-networks for improved efficiency and performance. The authors propose a sample-efficient training method and routing algorithms to systematically transform an already-trained LLM into a Flextron model. Evaluation on the GPT-3 and LLaMA-2 families demonstrates Flextron’s superior performance over end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes only 7.63% of the tokens compared to original pretraining.
more » « less
Full Text Available
Convolutional State Space Models for Long-Range Spatiotemporal Modeling

Smith, Jimmy TH; De_Mello, Shalini; Kautz, Jan; Linderman, Scott W; Byeon, Wonmin (October 2023, arXiv)

Full Text Available
Convolutional Tensor-Train LSTM for Spatio-temporal Learning

Su, Jiahao; Byeon, Wonmin; Huang, Furong; Kautz, Jan; Anandkumar, Animashree (January 2020, Conference on Neural Information Processing Systems)

Full Text Available
Instance-Aware, Context-Focused, and Memory-Efficient Weakly Supervised Object Detection

https://doi.org/10.1109/CVPR42600.2020.01061

Ren, Zhongzheng; Yu, Zhiding; Yang, Xiaodong; Liu, Ming-Yu; Lee, Yong Jae; Schwing, Alexander G.; Kautz, Jan (June 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
null (Ed.)
Full Text Available
Reblur2Deblur: Deblurring videos via self-supervised learning

https://doi.org/10.1109/ICCPHOT.2018.8368468

Chen, Huaijin; Gu, Jinwei; Gallo, Orazio; Liu, Ming-Yu; Veeraraghavan, Ashok; Kautz, Jan (May 2018, IEEE International Conference on Computational Photography)

Full Text Available

Search for: All records